In order to solve the local and global contextual information adaptive aggregation problem in traffic scene parsing, a Local and Global Context Attentive Fusion Network (LGCAFN) with three-module architecture was proposed. The front-end feature extraction module consisted of the improved 101-layer Residual Network (ResNet-101) which was based on Cascaded Atrous Spatial Pyramid Pooling (CASPP) unit, and was able to extract object’s multi-scale local features more effectively. The mid-end structural learning module was composed of eight Long Short-Term Memory (LSTM) branches, and was able to infer spatial structural features of object’s adjacent scene regions in eight different directions more accurately. In the back-end feature fusion module, a three-stage fusion method based on attention mechanism was adopted to adaptively aggregate useful contextual information and shield from noisy contextual information, and the generated multi-modal fusion features were able to represent object’s semantic information in a more comprehensive and accurate way. Experimental results on Cityscapes standard and extended datasets demonstrate that compared to the existing state-of-the-art methods such as Inverse Transformation Network (ITN), and Object Contextual Representation Network (OCRN), LGCAFN achieves the best mean Intersection over Union (mIoU), reaching 84.0% and 86.3% respectively, showing that LGCAFN can parse traffic scenes accurately and is helpful to realize autonomous driving of vehicles.